Given that I performed my last analysis as an excerise to literally teach myself how to use Python and Matplotlib, I wanted to clean it up, abstracting some of the distracting code into a helper file and using some new data visualization libraries I've come across. I also wanted to try my hand at transforming my web crawler from a synchronous program to an asynchronous program, which results in runtime improvements of 50% - check out the Performance Enhancement Section below)
%load_ext autoreload
%autoreload 2
import pandas as pd
import ssl
import atla_func as scrape
from bs4 import BeautifulSoup
import requests
import hvplot
import httpx
import asyncio
from bs4 import BeautifulSoup
import numpy as np
import matplotlib.pyplot as plt
import hvplot.pandas
ssl._create_default_https_context = ssl._create_unverified_context
gaang = [character.lower() for character in
["Aang", "Katara", "Sokka", "Suki", "Toph", "Zuko", "Iroh", "Mai", "Ty Lee", "Azula", 'other']]
90% of the battle in data analysis/visualization is selecting the appropriate dataset. Unfortunately the website I used last time is now defunct (I sure hope my previous webscraper didn't kill them...). However, Wikipedia now has scripts hosted on their servers, and presumably these should be more robust.`
ep_df = pd.read_html('https://en.wikipedia.org/wiki/List_of_Avatar:_The_Last_Airbender_episodes')
ep_info_lst = []
for season in range(1, 4):
szn_count = 1
#[1-1 removes quotation marks from the string; alternative is .Title.str.replace('\"', ""), but str methods are slow]
for title in ep_df[season]["Title"].apply(lambda title: title[1:-1]):
ep_info_lst.append(scrape.create_episode_dict(season, szn_count, title, gaang))
szn_count +=1
ep_df = pd.DataFrame(ep_info_lst).drop(columns='Dialogue')
ep_df
| Season | Episode | Title | |
|---|---|---|---|
| 0 | 1 | 1 | The Boy in the Iceberg |
| 1 | 1 | 2 | The Avatar Returns |
| 2 | 1 | 3 | The Southern Air Temple |
| 3 | 1 | 4 | The Warriors of Kyoshi |
| 4 | 1 | 5 | The King of Omashu |
| ... | ... | ... | ... |
| 56 | 3 | 17 | The Ember Island Players |
| 57 | 3 | 18 | Sozin's Comet, Part 1: The Phoenix King |
| 58 | 3 | 19 | Sozin's Comet, Part 2: The Old Masters |
| 59 | 3 | 20 | Sozin's Comet, Part 3: Into the Inferno |
| 60 | 3 | 21 | Sozin's Comet, Part 4: Avatar Aang |
61 rows × 3 columns
You may notice that the function below uses await - by modifying the html grabbing functions to be asynchronous instead of synchronous, we actual reduce the amount of time spent web scraping by 50%-70% depending on the internet connection at the time of calling the function.
See the Performance Enhancement Section at the end of this analysis
# dialogue_df = await scrape.make_atla_df(ep_df, gaang, other=True)
# dialogue_df.to_pickle('dialogue_df.pkl')
dialogue_df = pd.read_pickle('dialogue_df.pkl')
dialogue_df.head(2)
| Season | Episode | Title | aang | katara | sokka | suki | toph | zuko | iroh | mai | ty lee | azula | other | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | The Boy in the Iceberg | [I need to ask you something ..., Please ... c... | [Sokka, look!, But, Sokka! I caught one!, Hey!... | [It's not getting away from me this time. Wat... | [] | [] | [Finally! Uncle, do you realize what this mea... | [I won't get to finish my game?, Or it's just ... | [] | [] | [] | [Well, no one has seen an airbender in a hundr... |
| 1 | 1 | 2 | The Avatar Returns | [Yeah. We were on the ship and there was this ... | [Aang didn't do anything! It was an accident.,... | [I knew it! You signaled the Fire Navy with t... | [] | [] | [Where are you hiding him?, He'd be about this... | [Hey, you mind taking this to his quarters for... | [] | [] | [] | [Yay! Aang's back!, Katara, you shouldn't have... |
I've learned about some new libraries since last time and think that the data trends I tried visualizing in matplotlib lineplots could be more itnerative and informative.
On the wordcloud front, I can use my space more efficiently if I organize the wordclouds in a grid instead of scrolling endlessly as they plot individual figures.
A few thoughts I had about how I could improve over the previous iteration of graphs;
import numpy as np
import hvplot.pandas # noqa
# import hvplot.dask # noqa
ax1 = hvplot.plot(dialogue_df[gaang].applymap(lambda x: scrape.count_words(x)).sum().sort_values(ascending=False).rename("# lines"),
kind='bar',
title='Total Words Spoken',
ylabel='# Words',
width=300,
height=400,
rot=60,
dynamic=False
).opts(color='index', cmap=scrape.color_dict, show_legend=False, axiswise=True)
gaang_for_analysis=["aang", "katara", "sokka", "toph", "zuko", "suki"]
tmp = dialogue_df[gaang_for_analysis].applymap(lambda x: scrape.count_words(x))
ax2 = hvplot.plot(tmp.div(tmp.sum(axis=1), axis=0).join(ep_df.Title),
kind='bar',
stacked=True,
title='Character Dialogue % per Episode',
xlabel='Episode Number',
ylabel='% of Episode Dialogue',
width=650,
height=400,
rot=60,
bar_width=1.,
line_alpha=0,
hover_alpha=.5,
dynamic=False
).opts(cmap=scrape.color_dict, axiswise=True, show_legend=True, fontsize={'xticks': 6})
ax1 + ax2
A few interesting things that this view shows us that was harder to see in the previous iterations of the graphs:
fig, ax = plt.subplots(2, 3, figsize=(20, 15));
wordcloud_gaang = ["aang", "katara", "sokka", "toph", "zuko", 'iroh']
gaang_images = ["Aang Photoshop.jpg", "Katara Photoshop.jpg", "Sokka Photoshop.jpg", "Toph Photoshop.jpg",
"Zuko Photoshop.jpg", 'Iroh Photoshop.jpg']
zipdict = zip(wordcloud_gaang, gaang_images)
for row in ax:
for subplot in row:
character, av_image = next(zipdict)
scrape.make_wordcloud(subplot, dialogue_df, character, av_image)
While most of the code was overhauled between the previous analysis and this one, the bulk of the change was driven by necessity as the the previous script website no longer exists and scraping the Wikipedia dialogue tables required a different approach.
The core exception to this is in pulling the script data. Whereas the prior analysis utilized a synchronous process to pull each page's script, the code has been updated to utilize an asynchronous approach, improving the speed of execution. In other words, in the previous web crawler, I could not pull script 2 until I had finished pulling script 1. With this new approach, I can pull script 1 and 2 at the same time!
The performance increase results in the new code executing between 2-4x as fast as the previous iterations; the exact performancce increase fluctuates (presumably internet connection plays a factor, as there is web-scraping involved). Compare the runtime values in the execution profiles between the old method (usually 40-60s) and new method (~15s):
*Note: Performance times may be slightly inflated; pulled using a mobile hotspot. Proportons should still be comparable.
import cProfile
import pstats
with cProfile.Profile() as pr_old:
ep_info_lst = []
for episode_num in range(len(ep_df)):
url = f"http://avatar.fandom.com/wiki/Transcript:{ep_df.Title.iloc[episode_num]}"
ep_dict = scrape.create_episode_dict(ep_df.Season.iloc[episode_num], ep_df.Episode.iloc[episode_num], ep_df.Title.iloc[episode_num], gaang)
scrape.get_atla_dialogue(ep_dict, url, character_list=gaang, other=True)
ep_info_lst.append(ep_dict)
df_old = pd.DataFrame(ep_info_lst)
df_old = df_old.drop("Dialogue", axis=1).join(df_old["Dialogue"].apply(pd.Series))
print('OLD METHOD\n')
old_stats = pstats.Stats(pr_old)
old_stats.sort_stats(pstats.SortKey.TIME)
old_stats.print_stats('atla_func')
with cProfile.Profile() as pr_async:
df_new = await scrape.make_atla_df(ep_df, gaang, other=True)
print("NEW METHOD\n")
stats = pstats.Stats(pr_async)
stats.sort_stats(pstats.SortKey.TIME)
stats.print_stats('atla_func')
OLD METHOD
20182872 function calls (20180657 primitive calls) in 41.098 seconds
Ordered by: internal time
List reduced from 1216 to 7 due to restriction <'atla_func'>
ncalls tottime percall cumtime percall filename:lineno(function)
61 0.041 0.001 1.028 0.017 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:60(get_character_dialogue)
9993 0.006 0.000 0.034 0.000 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:56(clean_dialogue)
9993 0.005 0.000 0.024 0.000 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:51(clean_square_brackets)
61 0.005 0.000 41.070 0.673 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:86(get_atla_dialogue)
61 0.001 0.000 40.037 0.656 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:42(crawl_page)
61 0.000 0.000 0.000 0.000 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:37(<dictcomp>)
61 0.000 0.000 0.000 0.000 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:27(create_episode_dict)
NEW METHOD
20097637 function calls (20092044 primitive calls) in 11.369 seconds
Ordered by: internal time
List reduced from 1503 to 11 due to restriction <'atla_func'>
ncalls tottime percall cumtime percall filename:lineno(function)
61 0.041 0.001 1.036 0.017 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:60(get_character_dialogue)
28 0.009 0.000 10.020 0.358 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:109(make_atla_df)
9993 0.006 0.000 0.034 0.000 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:56(clean_dialogue)
9993 0.005 0.000 0.024 0.000 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:51(clean_square_brackets)
61 0.000 0.000 9.950 0.163 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:104(get_atla_dialogue_new)
61 0.000 0.000 0.000 0.000 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:37(<dictcomp>)
61 0.000 0.000 0.000 0.000 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:27(create_episode_dict)
1 0.000 0.000 0.022 0.022 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:102(<listcomp>)
28 0.000 0.000 0.032 0.001 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:98(grab_htmls)
62 0.000 0.000 0.000 0.000 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:100(<genexpr>)
61 0.000 0.000 0.000 0.000 C:\Users\KyleK\OneDrive\3_Computer_Programming\DS-Projects\atla_func.py:113(<lambda>)
<pstats.Stats at 0x1ee1da470a0>
# Prove the methods are equivalent
df_old.equals(df_new)
True